Question 1. You are looking at a collaborator’s R code on github, and download the repository, and start exploring the code. The first line of the script is setwd(“C:/Users/…”)

- What is the author of this code trying to do with the function setwd()?

- Please discuss what is wrong with this approach in terms of reproducibility.

- Where is the working directory of an R project?

- Explain the concept of relative file paths. Is the author of this code using relative file paths?

The author is trying to change or set a new working directory. This function is not necessarily wrong, but issues arises because it is only accessible by him based on “users” and relative file references in the previous code will return errors because they are no longer valid when you change working directories. Where .RData is saved when he exits will also be changed to the new directory. This interferes with reproducibility because no one else can readily follow his file path. All a working directory is is the folder on your computer which contains the necessary files for the project. Relative file paths tell R where to find a file relative to the working directory rather than an absolute file path which is the entire path on your computer. They are often shorter, easier to follow (if everything is where it should be), and more reproducible between users and devices. Instead of the relative file path, this author is trying to use the absolute file path, which make his work more difficult to reproduce.

Question 2. What does the acronym FAIR stand for in the context of this class? Explain how R, GitHub, and other lecture concepts introduced in this course specifically help complete FAIR data principles.

The acronym FAIR stands for Findable, Accessible, Interpretable, and Reusable when it comes to reproducible research. Findable data is important to reproducibility as nothing can be done if the data cannot first be found. This means readily available data rather than keeping physical, handwritten copies under lock and key, guarded by a dragon.

Although he may look adorable!

Accessible data means making sure other researchers have access to the data easily. This is often done with online repositories such as GitHub and Ag Data Commons where people can find your data and licensing so they can also use that data if necessary. If your data is stored in your cloud, on your computer, or other personal access points, others cannot use it. Similarly, if only you can read it, it is not much help to anyone else. Interpretable research is just as important as the other points in FAIR. Making sure others can read and follow your process is crucial to understanding and reproducing it. Tools such as RMarkdown and even just annotated script can make interpreting your work much easier. Finally, reusable data is the whole goal of reproducible research.Other researchers should easily be able to find, access, and interpret your data so it can be reused for further projects. If no one can follow what you did and redo it step for step, then your reproducibility has failed somewhere.

Question 3. Explain the concept of R packages. What are R packages? Who writes R packages? What is the difference between installing and loading a package? Explain two ways to install and load packages into R.

R Packages are a collection of pre-written functions, coding, and data that can help you use R for different purposes rather than its basic functions. Anyone can write R packages, but they are typically written by programmers before you who needed code for a similar purpose. I like to compare it to books in a library. You could do your own research on a topic, say bears, and write your own book, but there is probably a book out there by an previous bear author that includes all the things you already need. Installing a package is like going to the bookstore to buy the bear book. You are initially acquiring what you need. However, loading an R package is like going to get it off your bookshelf. You always have it on your bookshelf now (installed), but you have to go take it off the bookshelf if you want to use it (loading). To install a package you can just type:

install.packages(‘Name of Package’)

or

Select Tools > Install Packages

Select Repository (CRAN) in the Install from: slot

Type the package name

Click Install

If you want to load a package, you can type:

library(Name of Package)

or

Select the packages tab in the bottom right box and check the package you want to load.

Question 4. Explain the following concepts of ggplot and give examples of each concept using code and figures generated with ggplot using the data of your choosing.

- Layering

Layering in ggplot takes a combination of data, statistics, and geometric objects and stacks it all upon each other. Individual layers can be manipulated to create a complex visualization that portrays your data in the best way.

library(ggplot2)
Limo <- read.csv("LimoGHR.csv")

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point() #this shows all the data points in this set

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point()+
  geom_smooth() #with this layer included, now the general trend lines are visible
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

- Scales

Scales allow you to manipulate the length of the x and y axes as well as transform them using logs or square roots.

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point()+
  geom_smooth()+
   xlim(0, 15)+ylim(0, 3000)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

- Themes

Themes works just like it does for other programs like Microsoft Word or Powerpoint. It allows you to adjust certain pre-set aesthetic values. You could code your graphs to look the same way, but someone already has, so it is easier to just use themes.

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point() #this is the generic ggplot graph

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point()+
  theme_classic() #this theme removes the gridlines and background color!

- Facets

Facets help break your data apart into smaller groups based on different factors to allow you to visualize it more clearly. It subsets your data and gives you different graphs rather than all the data on one.

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point()+
  geom_smooth() #here is the same graph from earlier
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

ggplot(Limo, aes(x = Time ,y = Total, color = Trt))+
  geom_point()+
  geom_smooth()+
  facet_grid(. ~ Trt) #Notice how the graphs are broken down by treatments
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Question 5. Explain the differences and similarities between a vector, matrix, and dataframe. Demonstrate you know how to subset a dataframe in two ways using the built in dataset ‘ToothGrowth’ with the prompts below:

A vector is a sequence of data of the same type, separated by commas. It is the simplest form of data you can have in R. A dataframe is a set of data in a table that can include different types of data. It is the most common form of data. A matrix is a set of data in a table that can only include 1 type of data.

- Subset ToothGrowth to include rows such that supp is equal to VC

data("ToothGrowth")

subset(ToothGrowth, supp == 'VC')
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5
## 11 16.5   VC  1.0
## 12 16.5   VC  1.0
## 13 15.2   VC  1.0
## 14 17.3   VC  1.0
## 15 22.5   VC  1.0
## 16 17.3   VC  1.0
## 17 13.6   VC  1.0
## 18 14.5   VC  1.0
## 19 18.8   VC  1.0
## 20 15.5   VC  1.0
## 21 23.6   VC  2.0
## 22 18.5   VC  2.0
## 23 33.9   VC  2.0
## 24 25.5   VC  2.0
## 25 26.4   VC  2.0
## 26 32.5   VC  2.0
## 27 26.7   VC  2.0
## 28 21.5   VC  2.0
## 29 23.3   VC  2.0
## 30 29.5   VC  2.0

- Subset ToothGrowth to include rows such that supp is equal to VC and dose is equal to 0.5

subset(ToothGrowth, supp == 'VC' & dose == '0.5')
##     len supp dose
## 1   4.2   VC  0.5
## 2  11.5   VC  0.5
## 3   7.3   VC  0.5
## 4   5.8   VC  0.5
## 5   6.4   VC  0.5
## 6  10.0   VC  0.5
## 7  11.2   VC  0.5
## 8  11.2   VC  0.5
## 9   5.2   VC  0.5
## 10  7.0   VC  0.5

- Subset ToothGrowth to include the values of len such that supp is equal to VC and dose is equal to 0.5

subset(ToothGrowth, select = c(len), supp == 'VC' & dose == '0.5')
##     len
## 1   4.2
## 2  11.5
## 3   7.3
## 4   5.8
## 5   6.4
## 6  10.0
## 7  11.2
## 8  11.2
## 9   5.2
## 10  7.0

Question 6. Create an R markdown version of your answer to question 4 and 5. Save the .Rmd file to your computer and render it as a word document (.docx), .html, and a .md file. Push these files to your github and paste your github url here.

I have done my whole exam in R Markdown, so this question should be complete! Let me know if it is not. My GitHub is:

https://github.com/jmm0206/ENTM6820

Question 7. What is the correct order of events to get your code on github through R studio? Explain each step from creation of a repository to pushing.

On your GitHub repositories page, you will select the green “New” button in the top, left-hand corner. Here you can name your new repository and add a description for it. You will be able to select privacy as well as the license you would like to apply. The license is important as it legally says how others can use your work.

Then, in R Studio, when you have a file you would like to push to your repository, you will make sure the file is saved and then select the Git tab in the top, right-hand corner. Make sure the file you want is checked and select commit at the top of that subsection. From there, another window will open where you should describe what the file is and commit it at the bottom of that textbox. If that returns no errors, then you can choose the green push arrow in the top, right-hand corner to push the file to GitHub.

Question 8. After you have worked on a project for a while, you mistakenly delete a file on your github, while it still exists in your local repository (on your computer). Now when you try to push your code to github the push is rejected and gives the following error, “Updates were rejected because the remote contains work that you do not have locally.” How do you solve this error?

Based on others interactions with this error online, the best way to go about the issue is to simply pull the file to merge it before you push the files. Another way suggested is to run (git push -f origin master) as this is a forced push and should override the previous commits.

Question 9. Explain the purpose of a Data Management Plan.

A Data Management Plan presents clear guidelines to everyone involved on a project about who, how, and where data will be managed. It can increase efficiency and organization an aid in clearer communication. It is also often a requirement from funding agencies. A DMP lays out what kind of data will be collected, where it will be stored, where it will be shared, and who is responsible for each of these steps. It is a formal document that precisely lays out the important details of the data management where everyone involved can refer to it and be on the same terms.

Question 10. A colleague gives you data in an .xlsx file that looks like this:

Please discuss at least five things wrong with how these data are formatted that make it not reproducibility friendly. Then describe/show your colleague how the data should be formatted.

1. It is best to store data in a CSV file which saves it as simple text. Excel files can be changed or become corrupted over time or between users, so a different file type would be useful.

2. The colors used in this data set are very unfriendly to colorblind eyes. If colors are necessary or desired, he should use a colorblind friendly palette.

3. The “AvgStandTreatment#” labels very poorly describe the data being shown. I have no idea what is being measured here and assume the Treatment# corrolates with different treatments, but again, have no idea what type of data is being collected. Units and Treatments should be more clearly defined.

4. To properly load data, it is best to have it under consistent headers. He needs to rearrange the data into a single, complete table that contains all the necessary data.

5. There are also multiple sheets in the Excel file. These will not translate to a CSV properly, when organizing data for a CSV or Text file, all data should be on the same sheet.